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Abstract 


To improve the efficiency of Monte Carlo estimation, practitioners are turning to 
biased Markov chain Monte Carlo procedures that trade off asymptotic exactness 
for computational speed. The reasoning is sound; a reduction in variance due to 
more rapid sampling can outweigh the bias introduced. However, the inexactness 
creates new challenges for sampler and parameter selection, since standard mea¬ 
sures of sample quality like effective sample size do not account for asymptotic 
bias. To address these challenges, we introduce a new computable quality measure 
based on Stein’s method that quantifies the maximum discrepancy between sam¬ 
ple and target expectations over a large class of test functions. We use our tool to 
compare exact, biased, and deterministic sample sequences and illustrate applica¬ 
tions to hyperparameter selection, convergence rate assessment, and quantifying 
bias-variance tradeoffs in posterior inference. 


1 Introduction 

When faced with a complex target distribution, one often turns to Markov chain Monte Carlo 
(MCMC) 111 to approximate intractable expectations Ep[/i(Z)] = j^ p{x)h{x)dx with asymp¬ 
totically exact sample estimates Eq[/i(X)] = d{xi)h{xi). These complex targets commonly 

arise as posterior distributions in Bayesian inference and as candidate distributions in maximum 
likelihood estimation m- In recent years, researchers [e.g., Ennia have introduced asymptotic bias 
into MCMC procedures to trade off asymptotic correctness for improved sampling speed. The ra¬ 
tionale is that more rapid sampling can reduce the variance of a Monte Carlo estimate and hence 
outweigh the bias introduced. However, the added flexibility introduces new challenges for sampler 
and parameter selection, since standard sample quality measures, like effective sample size, asymp¬ 
totic variance, trace and mean plots, and pooled and within-chain variance diagnostics, presume 
eventual convergence to the target IT] and hence do not account for asymptotic bias. 

To address this shortcoming, we develop a new measure of sample quality suitable for comparing 
asymptotically exact, asymptotically biased, and even deterministic sample sequences. The quality 
measure is based on Stein’s method and is attainable by solving a linear program. After outlining 
our design criteria in Sectionl^we relate the convergence of the quality measure to that of standard 
probability metrics in Section^ develop a streamlined implementation based on geometric spanners 
in Section and illustrate applications to hyperparameter selection, convergence rate assessment, 
and the quantification of bias-variance tradeoffs in posterior inference in Section We discuss 
related work in Sectionj^and defer all proofs to the appendix. 

Notation We denote the £ 2 , £ 1 , and £00 norms on by 11-112, || • || and || • ||^^ respectively. We will 
often refer to a generic norm H-H on with associated dual norms ||ru||* = sup„gRd.||„||_i {w,v) 
for vectors w G ||M||* = sup^gRd.||„||^i for matrices M G and ||r|l* = 

sup„gRd.||„||_]^ ||T[u] II* for tensors T G We denote the j-th standard basis vector by e^-, the 

partial derivative by Vfc, and the gradient of any M'^-valued function g by Vg with components 


i'^9{x))jk = Vkgj{,x). 
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2 Quality Measures for Samples 


Consider a target distribution P with open convex support A’ C and continuously differentiable 
density p. We assume that p is known up to its normalizing constant and that exact integration under 
P is intractable for most functions of interest. We will approximate expectations under P with the 
aid of a weighted sample, a collection of distinct sample points xi,... ,Xn S X with weights q{xi) 
encoded in a probability mass function q. The probability mass function q induces a discrete distri¬ 
bution Q and an approximation Eg \h{Xy\ = <l{xi)h{xi) for any target expectation Ep 
We make no assumption about the provenance of the sample points; they may arise as random draws 
from a Markov chain or even be deterministically selected. 

Our goal is to compare the fidelity of different samples approximating a common target distribution. 
That is, we seek to quantify the discrepancy between Eg and Ep in a manner that (i) detects when 
a sequence of samples is converging to the target, (ii) detects when a sequence of samples is not 
converging to the target, and (iii) is computationally feasible. We begin by considering the maximum 
deviation between sample and target expectations over a class of real-valued test functions %, 

dn{Q,P) = sup |Eg[/i(X)] - Ep[h{Z)]\. (1) 

h^n 

When the class of test functions is sufficiently large, the convergence of d'H{Qm, P) to zero implies 
that the sequence of sample measures {Qm)m>i converges weakly to P. In this case, the expression 
Q is termed an integral probability metric (1PM) Q. By varying the class of test functions H, we 
can recover many well-known probability metrics as IPMs, including the total variation distance, 
generated by H = {/i : A" —)■ K | sup 2 :g;t^ \h{x) \ < 1}, and the Wasserstein distance (also known as 
the Kantorovich-Rubenstein or earth mover’s distance), dwn n > generated by 

n = W||.|| 4 {h : A- ^ K I sup,^,,;, < 1}. 

The primary impediment to adopting an IPM as a sample quality measure is that exact computation 
is typically infeasible when generic integration under P is intractable. However, we could skirt this 
intractability by focusing on classes of test functions with known expectation under P. For example, 
if we consider only test functions h for which Ep[h{Z)] — 0, then the IPM value d'p{Q, P) is the 
solution of an optimization problem depending on Q alone. This, at a high level, is our strategy, 
but many questions remain. How do we select the class of test functions hi How do we know that 
the resulting IPM will track convergence and non-convergence of a sample sequence (Desiderata 
(i) and (ii))? How do we solve the resulting optimization problem in practice (Desideratum (iii))? 
To address the hrst two of these questions, we draw upon tools from Charles Stein’s method of 
characterizing distributional convergence. We return to the third question in Sectionj^ 

3 Stein’s Method 

Stein’s method jTl for characterizing convergence in distribution classically proceeds in three steps: 

1. Identify a real-valued operator T acting on a set Q of K‘^-valuec0functions of X for which 

Ep[{Tg){Z)]=Q for all g&g. (2) 

Together, T and Q dehne the Stein discrepancy, 

S{Q,T,g)^snp\EQ[{Tg){X)]\=snv\EQ[{Tg){X)]-Ep[{Tg){Z)]\ = drQ{Q,P), 

g&G g&O 

an IPM-type quality measure with no explicit integration under P. 

2. Lower bound the Stein discrepancy by a familiar convergence-determining IPM dp. This 

step can be performed once, in advance, for large classes of target distributions and ensures 
that, for any sequence of probability measures S{fim,P,g) converges to zero 

only if dp{p,rn, P) does (Desideratum (ii)). 

3. Upper bound the Stein discrepancy by any means necessary to demonstrate convergence to 
zero under suitable conditions (Desideratum (i)). In our case, the universal bound estab¬ 
lished in Section l33] will suffice. 

'Scalar functions g are more common in Stein’s method, but we will find R''-valued g more convenient. 
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While Stein’s method is typically employed as an analytical tool, we view the Stein discrepancy as 
a promising candidate for a practical sample quality measure. Indeed, in Section]^ we will adopt an 
optimization perspective and develop efficient procedures to compute the Stein discrepancy for any 
sample measure Q and appropriate choices of T and Q. First, we assess the convergence properties 
of an equivalent Stein discrepancy in the subsections to follow. 


3.1 Identifying a Stein Operator 


The generator method of Barbour El provides a convenient and general means of constructing op¬ 
erators T which produce mean-zero functions under P ^ . Let {Zt)t>Q represent a Markov process 
with unique stationary distribution P. Then the infinitesimal generator A of (Z()t>o, defined by 

(.Au)(x) = lini {E[u{Zt) \ Zq = x] — u{x))/t for u : ^ R, 

satisfies Ep[{Au){Z)] = 0 under mild conditions on A and u. Hence, a candidate operator T can 
be constructed from any infinitesimal generator. 


For example, the overdamped Langevin dijfusion, defined by the stochastic differential equation 
dZt = IV \ogp{Zt)dt + dWt for {Wt)t>o a Wiener process, gives rise to the generator 

{Apu){x) = Vlogp(a;)) -f i(V,Vu(a:)). (3) 

After substituting g for ^ Vm, we obtain the associated Stein operato^ 

{Tpgfix) = {g{x),V\ogp{x)) + (V,g(x)). (4) 


The Stein operator 7p is particularly well-suited to our setting as it depends on P only through the 
derivative of its log density and hence is computable even when the normalizing constant of p is not. 


If we let dX denote the boundary of X (an empty set when X = R'^) and n{x) represent the outward 
unit normal vector to the boundary at x, then we may define the classical Stein set 


G\\w = {9--x^: 


sup max 


iig(x)iiMjvg(x)ir 


|Vg(x)-Vg(y)H 

\\x-y\\ 


< 1 and 


{g{x),n{x)) 


0,\/x € dX with n(x) defined 


of sufficiently smooth functions satisfying a Neumann-type boundary condition. The following 
proposition - a consequence of integration by parts - shows that C/||.|| is a suitable domain for Tp. 

Proposition 1. //'Ep[|| V logp(Z)||] < oo, then Ep[(7pg)(Z)] = 0 for all g € ^||.||. 


Together, Tp and C/||.|| form the classical Stein discrepancy S{Q,Tp, (/n-n). our chief object of study. 


3.2 Lower Bounding the Classical Stein Discrepancy 

In the univariate setting (d = 1), it is known for a wide variety of targets P that the classical Stein 
discrepancy S{p,rn,Tp, G\\-\\ ) converges to zero only if the Wasserstein distance dw^^ {^m, P) does 
[? So). In the multivariate setting, analogous statements are available for multivariate Gaussian 
targets nmniiin], but few other target distributions have been analyzed. To extend the reach of the 
multivariate literature, we show in Theorem that the classical Stein discrepancy also determines 
Wasserstein convergence for a large class of strongly log-concave densities, including the Bayesian 
logistic regression posterior under Gaussian priors. 

Theorem 2 (Stein Discrepancy Lower Bound for Strongly Log-concave Densities). If X = and 
logp is strongly concave with third and fourth derivatives bounded and continuous, then, for any 
probability measures (gm)m>i, S{grn,fp, ^/IHI ) 0 only ifdw^^.^^ (Pm, P) 0. 

We emphasize that the sufficient conditions in Theorem are certainly not necessary for lower 
bounding the classical Stein discrepancy. We hope that the theorem and its proof will provide a tem¬ 
plate for lower bounding S{Q,Tp, G\\-\\ ) for other large classes of multivariate target distributions. 

^The operator Tp has also found fruitful application in the design of Monte Carlo control variates O. 
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3.3 Upper Bounding the Classical Stein Discrepancy 


We next establish sufficient conditions for the convergence of the classical Stein discrepancy to zero. 
Proposition 3 (Stein Discrepancy Upper Bound). If X ^ Q and Z ^ P with V logp(^) integrable, 


S(Q,Tp, ^iHi) < ||/|| E[||X - Z||] + E[|| V logp(X) - V \ogp{Z)\\] + E[|| V logp(Z)(X - Zf 


< ||/||E[|iX-Z||]+E[||Vlogp(X)-Vlogp(Z)|l] +WE ||Vlogp(Z)r E ||X - Z|! 


One implication of Propositionis that S{Qm, Tp, f/||.||) converges to zero whenever Xm ~ Qm 
converges in mean-square to Z ~ P and V \ogp{Xm) converges in mean to V logp(Z). 


3.4 Extension to Non-uniform Stein Sets 


The analyses and algorithms in this paper readily accommodate non-uniform Stein sets of the form 

_ iivgwir ^ iivg(a^)-vgfa)ir ^ < ^ 






max 


(5) 


{g{x),n{x)) = 0,Va: S dX with n(x) dehned 

for constants ci, C 2 , C 3 > 0 known as Stein factors in the literature. We will exploit this additional 
flexibility in Section 5.2 to establish tight lower-bounding relations between the Stein discrepancy 
and Wasserstein distance for well-studied target distributions. For general use, however, we advocate 
the parameter-free classical Stein set and graph Stein sets to be introduced in the sequel. Indeed, any 
non-uniform Stein discrepancy is equivalent to the classical Stein discrepancy in a strong sense: 

Proposition 4 (Equivalence of Non-uniform Stein Discrepancies). For any ci, C 2 , C 3 > 0, 
min(ci,C 2 ,C 3 ) 5 (( 3 , 7 p,(/||.||) < S{Q,Tp,GnT) < ina.x{ci,C2,C3)S{Q,Tp,G\\.\\)- 


4 Computing Stein Discrepancies 


In this section, we introduce an efficiently computable Stein discrepancy with convergence prop 
erties equivalent to those o f the cla ssical discrepancy. We restrict attention to the unconstr ained 


domain A" = in Sections 4.l|4.3 and present extensions for constrained domains in Section 


4.4 


4.1 Graph Stein Discrepancies 

Evaluating a Stein discrepancy S{Q,Tp, G) for a fixed {Q, P) pair reduces to solving an optimiza¬ 
tion program over functions g G G- Eor example, the classical Stein discrepancy is the optimum 

SiQ,Tp,G\\.\\) = sup Qixi){{g{xi),X \ogp{xi)) + {V,g{xi))) (6) 
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s.t. ||5(a:)ir < l,||V 5 (x)|r < I, ||V 5 (a:) - V 5 (y)||* < \\x - y\\,^x,y G A”. 

Note that the objective associated with any Stein discrepancy S{Q, Tp, G) is linear in g and, since 
Q is discrete, only depends on g and Vp through their values at each of the n sample points Xi. The 
primary difficulty in solving the classical Stein program stems from the infinitude of constraints 
imposed by the classical Stein set ^||.j|. One way to avoid this difficulty is to impose the classical 
smoothness constraints at only a finite collection of points. To this end, for each finite graph G = 
(y, E) with vertices U C A" and edges E C we define the graph Stein set, 

0|M|,Q.G = l^g ■■ X \ y X gV, max(|| 5 (a;)|r, |jV 5 (a:)|r) < 1 and, V(x,?/) G E, 


(\\gG)-g(y)r 

||Vs(x)-Vg(!/)|r 

l|g(a;)-g(i/)-Vg(a;)(a;-y)||* 

l|g(a;)-g(g)-Vg(y)(a;-y)||* \ ^ ,\ 

{ lk-g|l ’ 

lk-y|| ’ 


hw^-vr / ^ J 


the family of functions which satisfy the classical constraints and certain implied Taylor compati¬ 
bility constraints at pairs of points in E. Remarkably, if the graph Gi consists of edges between all 
distinct sample points Xi, then the associated complete graph Stein discrepancy S{Q,Tp, (/||.|| ,q,Gi) 
is equivalent to the classical Stein discrepancy in the following strong sense. 
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Proposition 5 (Equivalence of Classical and Complete Graph Stein Discrepancies). IfX = and 
Gi = {supp{Q),Ei) with Ei = {{xi,xi) G supp(Q)^ : Xi ^ xi], then 

S{Q,Tp,Q\\.\\) < 5((5,Tp,^||.||,q,gJ < KdS{Q,Tp,Q\\.\\), 

where Kd is a constant, independent of {Q, P), depending only on the dimension d and norm || • ||. 

Proposition 1^ follows from the Whitney-Glaeser extension theorem for smooth functions llT4l[T5ll 
and implies that the complete graph Stein discrepancy inherits all of the desirable convergence prop¬ 
erties of the classical discrepancy. However, the complete graph also introduces order nf constraints, 
rendering computation infeasible for large samples. To achieve the same form of equivalence while 
enforcing only Oin) constraints, we will make use of sparse geometric spanner subgraphs. 


4.2 Geometric Spanners 

For a given dilation factor f > 1, a t-spanner iniini is a graph G = {V, E) with weight ||a; — y\\ 
on each edge {x,y) G E and a path between each pair x' y' G V with total weight no larger 
than f||a;' — j/'||. The next proposition shows that spanner Stein discrepancies enjoy the same con¬ 
vergence properties as the complete graph Stein discrepancy. 

Proposition 6 (Equivalence of Spanner and Complete Graph Stein Discrepancies). If X = 

Gt = {svlPp{Q),E) is a t-spanner, and Gi = (supp((5), a;;) G supp(Q)^ : Xi xi\), then 

Moreover, for any Ip norm, a 2-spanner with 0{Kdn) edges can be computed in 0{Kdn^og{n)) 
expected time for Kd a constant depending only on d and ||- II M- As a result, we will adopt a 
2-spanner Stein discrepancy, S{Q, Tp, ^ii-n ,q,G 2 )’ o™ standard quality measure. 


4.3 Decoupled Linear Programs 


The final unspecified component of our Stein discrepancy is the choice of norm 11 • 11. We recommend 
the £i norm, as the resulting optimization problem decouples into d independent finite-dimensional 
linear programs (LPs) that can be solved in parallel. More precisely, S{Q, 7p, ^||.||^,Q,(y,£;)) equals 

sup Y.[=[QMilp'^j^ogp{vi)+Tjj,) (7) 


s.t. hjlL ^ 1-llElL ^ 1> andVz ^ ( : {Vi,vi) G E, 


IJWoo 

max 


f hji-ijil 


llrj(ei-ei) 




< 1 . 


We have arbitrarily numbered the elements Vi of the vertex set V so that represents the function 
value gj{vi), and Tjki represents the gradient value Vkgj{vi). 


4.4 Constrained Domains 


A small modification to the unconstrained formulation Q extends our tractable Stein discrepancy 
computation to any domain defined by coordinate boundary constraints, that is, to A' = (ai, ^i) x 
• • • X {ad, Pd) with —oo < aj < fUj < oo for all j. Specifically, for each dimension j, we augment 
the j-th coordinate linear program of Q with the boundary compatibility constraints 


max 


hjd 






< 1, for each i, bj G {a^, /Sj} n K, and k ^ j. (8) 


These additional constraints ensure that our candidate function and gradient values can be extended 


to a smooth function satisfying the boundary conditions {g{z),n{z)) = 0 on dX. Proposition 10 
in the appendix shows that the spanner Stein discrepancy so computed is strongly equivalent to me 
classical Stein discrepancy on X. 


Algorithm summarizes the complete solution for computing our recommended, parameter-free 
spanner Stein discrepancy in the multivariate setting. Notably, the spanner step is unnecessary in the 
univariate setting, as the complete graph Stein discrepancy S{Q, 7p, G\\-\\^,q,Gi ) can be computed 
directly by sorting the sample and boundary points and only enforcing constraints between consecu¬ 
tive points in this ordering. Thus, the complete graph Stein discrepancy is our recommended quality 
measure when d = 1, and a recipe for its computation is given in Algorithm]^ 
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Algorithm 1 Multivariate Spanner Stein Discrepancy 

input: Q, coordinate bounds (ai, /3i),..., {ad, Pd) with —oo < aj < Pj < oo for all j 
G 2 ^ Compute sparse 2-spanner of supp(Q) 

for j = 1 to d do (in parallel) 

Tj ^ Solve j-th coordinate linear program Q with graph G 2 and boundary constraints 
return J2‘j=i ^3 


Algorithm 2 Univariate Complete Graph Stein Discrepancy 

input: Q, bounds (a, P) with —00 < a < P < 00 
... ,a;(„/)) ^ SORT({a:i,...,a;„, a,/?} n M) 

return sup^gR„/_rgR„^ XlHi Qix{i)){li£^ogp{x^,)) + r^) 

s-t. |jr||^ < l,Vi < n',\ji\ < l[a < X(^i) < P], and, Vi < v!, 

l7.-7i+ll Irj-U+il |7i-7i+l-ri(3:(i)-a:(i+i))| |7.-7i+l-Fi+i (3:(i))| \ 


5 Experiments 

We now turn to an empirical evaluation of our proposed quality measures. We compute all spanners 
using the efficient C-n- greedy spanner implementation of Bouts et al. nsi and solve all optimization 
programs using Julia for Mathematical Programming ll20l with the default Gurobi 6.0.4 solver ET]. 
All reported timings are obtained using a single core of an Intel Xeon CPU E5-2650 v2 @ 2.60GHz. 

5.1 A Simple Example 

We begin with a simple example to illuminate a few properties of the Stein diagnostic. For the target 
P = Af(0,1), we generate a sequence of sample points i.i.d. from the target and a second sequence 
i.i.d. from a scaled Student’s t distribution with matching variance and 10 degrees of freedom. The 
left panel of Figure[T]shows that the complete graph Stein discrepancy applied to the first n Gaussian 
sample points decays to zero at an rate, while the discrepancy applied to the scaled Student’s 

t sample remains bounded away from zero. The middle panel displays optimal Stein functions g 
recovered by the Stein program for different sample sizes. Fach g yields a test function h = Tpg, 
featured in the right panel, that best discriminates the sample Q from the target P. Notably, the 
Student’s t test functions exhibit relatively large magnitude values in the tails of the support. 

5.2 Comparing Discrepancies 

We show in Theorem]^ in the appendix that, when d = 1, the classical Stein discrepancy is the 
optimum of a convex quadratically constrained quadratic program with a linear objective, 0{n) 
variables, and 0{n) constraints. This offers the opportunity to directly compare the behavior of the 
graph and classical Stein discrepancies. We will also compare to the Wasserstein distance (iw|| |p 
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Figure 1; Feft; Complete graph Stein discrepancy for a Af(0,1) target. Middle / right; Optimal 
Stein functions g and discriminating test functions h = Tpg recovered by the Stein program. 
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Figure 2; Comparison of discrepancy measures for sample sequences drawn i.i.d. from their targets. 

which is computable for simple univariate target distributions ll22l and provably lower bounds the 
non-uniform Stein discrepancies with Ci-^ = (0.5,0.5,1) for P = Unif(0,1) and ci :3 = (1,4, 2) 
for P = A/^(0,1) [? 123. For A/^(0( 1) and Unif(0,1) targets and several random number generator 
seeds, we generate a sequence of sample points i.i.d. from the target distribution and plot the non- 
uniform classical and complete graph Stein discrepancies and the Wasserstein distance as functions 
of the first n sample points in Figure Two apparent trends are that the graph Stein discrepancy 
very closely approximates the classical and that both Stein discrepancies track the fluctuations in 
Wasserstein distance even when a magnitude separation exists. In the Unif(0,1) case, the Wasser¬ 
stein distance in fact equals the classical Stein discrepancy because Tpg = p' is a Lipschitz function. 

5.3 Selecting Sampler Hyperparameters 

Stochastic Gradient Langevin Dynamics (SOLD) 111 with constant step size e is a biased MCMC 
procedure designed for scalable inference. It approximates the overdamped Langevin diffusion, 
but, because no Metropolis-Hastings (MH) correction is used, the stationary distribution of SOLD 
deviates increasingly from its target as e grows. If e is too small, however, SOLD explores the sample 
space too slowly. Hence, an appropriate choice of e is critical for accurate posterior inference. To 
illustrate the value of the Stein diagnostic for this task, we adopt the bimodal Gaussian mixture 
model (GMM) posterior of |[3l as our target. For a range of step sizes e, we use SGLD with minibatch 
size 5 to draw 50 independent sequences of length n = 1000, and we select the value of e with the 
highest median quality - either the maximum effective sample size (ESS, a standard diagnostic based 
on autocorrelation HI) or the minimum spanner Stein discrepancy - across these sequences. The 
average discrepancy computation consumes 0.4s for spanner construction and 1.4s per coordinate 
linear program. As seen in Figure]^ ESS, which does not detect distributional bias, selects the 
largest step size presented to it, while the Stein discrepancy prefers an intermediate value. The 
rightmost plot of Eigure[3b| shows that a representative SGLD sample of size n using the e selected 
by ESS is greatly overdispersed; the leftmost is greatly underdispersed due to slow mixing. The 
middle sample, with e selected by the Stein diagnostic, most closely resembles the true posterior. 

5.4 Quantifying a Bias-Variance Trade-off 

The approximate random walk MH (ARWMH) sampler Q is a second biased MCMC procedure 
designed for scalable posterior inference. Its tolerance parameter e controls the number of datapoint 
likelihood evaluations used to approximate the standard MH correction step. Qualitatively, a larger e 
implies fewer likelihood computations, more rapid sampling, and a more rapid reduction of variance. 
A smaller e yields a closer approximation to the MH correction and less bias in the sampler stationary 
distribution. We will use the Stein discrepancy to explicitly quantify this bias-variance trade-off. 

We analyze a dataset of 53 prostate cancer patients with six binary predictors and a binary outcome 
indicating whether cancer has spread to surrounding lymph nodes Il24ll . Our target is the Bayesian 
logistic regression posterior IT] under a JV{0,1) prior on the parameters. We run RWMH (e = 0) 
and ARWMH (e = 0.1 and batch size = 2) for 10® likelihood evaluations, discard the points from 
the first 10® evaluations, and thin the remaining points to sequences of length 1000. The discrepancy 
computation time for 1000 points averages 1.3s for the spanner and 12s for a coordinate LR Eigure|^ 
displays the spanner Stein discrepancy applied to the first n points in each sequence as a function of 
the likelihood evaluation count. We see that the approximate sample is of higher Stein quality for 
smaller computational budgets but is eventually overtaken by the asymptotically exact sequence. 


—^ Classical Stein 
- —• Wasserstein 

Complete graph Stein 
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(a) Step size selection criteria (b) 1000 SOLD sample points with equidensity contours of p overlaid 
Figure 3; ([^ ESS maximized at e = 5 x 10“^; Stein discrepancy minimized at e = 5 x 10“^. 
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Figure 4; Bias-variance trade-off curves for Bayesian logistic regression with approximate RWMH. 


To corroborate our result, we use a Metropolis-adjusted Langevin chain ll25l of length 10^ as a surro¬ 
gate Q* for the target and compute several error measures for each sample Q: normalized probability 
error max/ |E[cr((X,ru/)) - a{{Z,wi))]\/\\wi\\^, mean error second moment 

error ^ X Q, Z Q*, a(t) = and wi the Z-th datapoint covariate 

vector. The measures, also found in Figure]^ accord with the Stein discrepancy quantification. 


5.5 Assessing Convergence Rates 

The Stein discrepancy can also be used to assess the quality of deterministic sample sequences. In 
Figure|^in the appendix, for P = Unif(0,1), we plot the complete graph Stein discrepancies of the 
first n points of an i.i.d. Unif(0,1) sample, a deterministic Sobol sequence ll26l . and a deterministic 
kernel herding sequence lIZTl defined by the norm \\h\\^ = fg(h'(x))^dx. We use the median value 
over 50 sequences in the i.i.d. case and estimate the convergence rate for each sampler using the 
slope of the best least squares affine fit to each log-log plot. The discrepancy computation time 
averages 0.08s for n — 200 points, and the recovered rates of and n~^ for the i.i.d. and 

Sobol sequences accord with expected 0{l/y/n) and 0{l/n) bounds from the literature ESlI^ . 
As witnessed also in other metrics IMIl . the herding rate of outpaces its best known bound of 

dpi{Qn, P) = 0(l/v/n). suggesting an opportunity for sharper analysis. 


6 Discussion of Related Work 


We have developed a quality measure suitable for comparing biased, exact, and deterministic sample 
sequences by exploiting an infinite class of known target functionals. The diagnostics of 131113^ 
also account for asymptotic bias but lose discriminating power by considering only a finite collec¬ 
tion of functionals. For example, for a A/'(0,1) target, the score statistic of 1i32\ cannot distinguish 
two samples with equal first and second moments. Maximum mean discrepancy (MMD) on a char¬ 
acteristic Hilbert space ll3^ takes full distributional bias into account but is only viable when the 
expected kernel evaluations are easily computed under the target. One can approximate MMD, but 
this requires access to a separate trustworthy ground-truth sample from the target. 
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Figure 5: Comparison of complete graph Stein discrepancy convergence for P = Unif(0,1). 

A Proof of Proposition 


Our integrability assumption together with the boundedness of g and Vp imply that Ep[(V, g{Z))] 
and Ep[(p(Z), V logp(Z))] exist. Define the ^oo ball of radius r, Br = {x ^ ||a:||^ < 

r}. Since X is convex, the intersection A” n , 6 ^ is compact and convex with Lipschitz boundary 
d{X n ;Br). Thus, the divergence theorem (integration by parts) implies that 

Ep[(rp5)(Z)] =Ep[(V,5(Z)) + ( 5 (Z),Vlogp(Z))] = [ {V,p{z)g{z)) dz 

Jx 

= lim / {V,p{z)g{z)) dz = lim / {g{z),nr{z))p{z) dz 

J xnBr- Jd{xr\Br-) 

for rir the outward unit normal vector to d{X n Br)- The final quantity in this expression equates to 
zero, as {g{x), n{x)) = 0 for all x on the boundary dX, g is bounded, and limm_>oo pixm) = 0 for 
any {xm)m=i with Xm ^ X for all m and ||xm||oo oo- 


B Proof of Theorem 1^ Stein Discrepancy Lower Bound for Strongly 
Log-concave Densities 


We let {X ) denote the set of real-valued functions on X with k continuous derivatives and dj^ n 

denote the smooth function distance, the IPM generated by 


Af||.|i ^ {/i e c3(A’) 


suPxGA’ max^ll V/i(a;)||*, || ||*, || ||*^ < l|. 


We additionally define the operator norms ||u||j,p = ||'(;||2 for vectors v G ||AF||qp = 

stip„gRd:||„ll^=i ||Mu ||2 for matrices M G , and ||r||^p = Il'THLp for ten¬ 

sors T G 


The following result, proved in the companion paper 1^ , establishes the existence of explicit con¬ 
stants {Stein factors) ci, C 2 , C 3 > 0, such that, for any test function h G -Adn ip the Stein equation 

h{x)-Rp[h{Z)] = {rpgh){x) 

has a solution g^ = ^S7uh belonging to the non-uniform Stein set 

Theorem 7 (Stein Factors for Strongly Log-concave Densities ll34l Theorem 2.1]). Suppose that 
X = and that \ogp G C"^(A’) is k-strongly concave with 

sup||V^logp(z)|| <L 3 and suplllogp(z)|| < L 4 . 
zex ^ z&x ^ 

For each x G X, let {Zt^x)t>o represent the overdamped Langevin diffusion with infinitestimal 
generator 

iAu){x) = ^{Vu{x),V \ogp{x)) -f i(V,Vu(x)) (9) 

and initial state Zq ^ = x. Then, for each h G C^{X) with bounded first, second, and third 
derivatives, the function 

pOO 

Uh{x)^ / Ep[hiZ)]-E[h{Zt,x)]dt 
Jo 
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solves the the Stein equation 


and satisfies 


h{x) - Ep[h{Z)] = {Auh)ix) 


( 10 ) 


sup||Vu/,(z)||2 < 7 sup||V/i(z)||2, 


zex 


2 

k zex 

2L 


1 


sup||V <-p-sup||V/i(z )||2 + 7 sup||v h{z)\\^^, and 


zGX 


sup 


\'Sj'^Uh{z) - \l‘^uh{y)\ 

h-y\\2 


< 


6Li 


zGX 

SUp||V/l(z )||2 

zGX 


k z^x 

La 


SUp||V/l(z)||2 


3L 


+ -ffT sup||v^/i(^) 


zGA’ 




Hence, by the equivalence of non-uniform Stein discrepancies (Proposition d^n n (/r, P) < 
5(/x,Tp,0y(|p < max(ci, C 2 , cz)S{y,,Tp,Q\\.\\) for any probability measure p. 

The desired result now follows from Lemma which implies that the Wasserstein distance 
t(W||.|| ih'rmP) 0 whenever II II (/imjP) —)• 0 for a sequence of probability measures {nm)m>i- 

Lemma 8 (Smooth-Wasserstein Inequality). If y, and v are probability measures on and ||u|| > 
||u|| 2 /or a//u S then 

c(a^||.|| < dW||,||(d,i^) < 3max^dAt||.|| ^ (m, J^)v^E[||G||]^ 

for G a standard normal random vector in 

Lemma 2.2 of the companion paper 041 establishes this result for the case H-H = 11 - 112 ; we omit the 
proof of the generalization which closely mirrors that of the Euclidean norm case. 


C Proof of Proposition]^ Stein Discrepancy Upper Bound 

Fix any g in f/||.j[. By Proposition]^ E[{7pg){Z)] = 0. The Lipschitz and boundedness constraints 
on g and V g now yield 

EQ[(rpp)(X)] = E[(rp 5 )(X) - {Tpg){Z)] 

= E[( 5 (X), V \ 0 gp{X)) - (giZ), V logp(Z)) + (V, p(X) - g{Z))] 

= E[(5(^): V \ogp{X) - V logp(Z)) -f {g{X) - g{Z), V \ogp{Z))] 

+ E[{V,g{X)-g{Z))] 

< E[||Vlogp(X) - Vlogp(Z)||] + E[||Vlogp(Z)(X - Z)T||] + ||/||E[||X - Z||]. 


To derive the second advertised inequality, we use the definition of the matrix norm, the Fenchel- 
Young inequality for dual norms, the definition of the matrix dual norm, and the Cauchy-Schwarz 
inequality in turn; 


E[||Vlogp(Z)(X-Z)T||] =E 


< E 


sup {V logp{Z), M{X — Z)) 

M:||M ||*=1 


sup \\V\ogp{Z)\\\\M{X-Z)\\^ 


<E[||Vlogp(Z)||||X-Z||]< WE ||Vlogp(Z)f E ||X-Zf 


Since our bounds hold uniformly for all g in C/||.||, the proof is complete. 
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D Proof of Proposition Equivalence of Non-uniform Stein Discrepancies 

Fix any ci, C2, C3 > 0, and let Cmax = max(ci, C2, C3) and Cmin = inin(ci, C2, C3). Since the Stein 
discrepancy objective is linear in g, we have aS{Q^Tp,G\\.\\) = 7p,a0||.||) for any a > 0. 

The result now follows from the observation that CniinS|| || ^ Q Cmax0|| || • 

E Proof of Proposition]^ Equivalence of Classical and Complete Graph 
Stein Discrepancies 

The hrst inequality follows from the fact that ^||.|| C ^||.||_q_Gi- By the Whitney-Glaeser extension 
theorem ifTSl Thm. 1.4] of Glaeser iflTl . for every function g € G\\.\\^q^Gi’ there exists a function 
g & Kii ^ 1 * II with g{xi) = g{xi) and Vg{xi) = Vg{xi) for all Xi in the support of Q. Here Kd is a 
constant, independent of {Q,P), depending only on the dimension d and norm H lj. Since the Stein 
discrepancy objective is linear in g and depends on g only through the values g{xi) and Vg{xi), we 

have 5 (Q,Tp,0||.||,q,Gi) ^ S{Q,Tp,KdG\y\\) = HdS{Q,Tp,G\y\\)- 


F Proof of Proposition]^ Equivalence of Spanner and Complete Graph 
Stein Discrepancies 


The first inequality follows from the fact that 0 ||.||,q,Gi ^ G\\-\\,Q,Gt- Ftx any g G G\\-\\,Q,Gt ™tl any 
pair of points z, z' G supp(Q). By the dehnition of ^||.|| ,Q.Gt,wehavemax(||g(z)|r, ||V5(z)|r) < 
1. By the f-spanner property, there exists a sequence of points zq, zi, Z 2 ,..., Zp-i, zp G supp(Q) 
with zo = z and zp = z' for which (zi-i,zi) € i? for all 1 < Z < L and Ez=ill zi-i - zi\\ < 
t\\zo — zp\\. Since max^^ triangle inequal¬ 

ity implies that 


||V5(zo) - Vg{zp)\\* < ^||V 5 (z;_i) - V5r(z;)||* < ^||zi_i - Zzll < f||zo - Zp\\. 


1=1 


1=1 


Identical reasoning establishes that \\g{zo) — p(zl)||* < f||zo —-Zi||. 


Furthermore, since ||p(z/_i) — g{zi) — V(?(z;)(z/_i — zi)\\* < ||| 2 :z-i — zi\\'^ for each I, the trian¬ 
gle inequality and the dehnition of the tensor norm || -jl* imply that 

|| 5 (zo) - g{zp) - Vg{zp){zo - zp)\\* 

L 

< 'YJ^9{zi-i) - g{zi) - Vg{zi){zi-i - zz)!!* -f ||(Vg(zz) - Vg{zp)){zi-i - zz)]!* 

Z =1 


- ^ W^aizi) - Vp(zL)|r||zz_i - zzll 

Z =1 ^ 


- 51 “ ^'11^ + '^\\'^9{zi') - V5(zz-+i)iri|zz-i - zzll 

Z=1 ^ Z'=Z 


< - ^/ll ( 


L -1 


^zii + 5^11^'' 

Z'=Z 



< 



<t^zo 




Since z, z' were arbitrary, and the Stein discrepancy objective is linear in g, we conclude that 

S{Q,Tp,G\\.\\,Q,Gt) ^ S{Q,Tp,2t'^G\\.\\,Q,Gi) = 2f^5(Q,Tp,0||.||,Q,Gi)- 


G Finite-dimensional Classical Stein Program 

Theorem 9 (Finite-dimensional Classical Stein Program). If X = (a, (3) for —00 < a < /3 < 00 , 
and a;(i) < • • • < X(„/) represent the sorted values of {xi ,..., a, /3} n M, then the non-uniform 
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classical Stein discrepancy S{Q^ Tp, optimal value of the convex program 

inax J2'^=iq{x(^i))-£^\ogp{x(^i))g{x(^i)) + q{x(^,))g'{x(^i)) (11a) 

s.t. Vj e {!,... ,n'- 1}, \g\x(^i))\ < C 2 , |ff(a;(i+i)) - g(a;(i))| < C 2 (x(i+i) - a:(i)), (11b) 

9ix(z)) - 9{xi^+l)) + ^ {9'ix{i)) - g'(x(,+i)))^ + (^'(a^b)) + 5'(a;(i+i))) 

+ —(11c) 

C3 4 


5(a:(,+i)) - g{x(^,)) + “ ff'(a:b+i)))^ “ 


(/(a^w) + 5 '{a:(i+i))) 


“b ^ i,^u')p f: ^ (a'(2+l) a^(2))* ; 

Vi e {!,..., n'},|g(a;(,))| < l[a < X(,-, < /?] (ci - :^9'(x(i)f) 

ZC 3 


(lid) 

(lie) 


where {r)+ = max(r, 0), 

Lb = ^(x(i+i) - x^i)) - i(g'(a;(i)) + g'{x(^i+i))) - C 2 , and 

Lu = f (*b+i) “ ^(i)) + H9'i^{i)) + V'(a;(i+i))) - C2- 


We say the program GB is finite-dimensional, because it suffices to optimize over vectors 7 , T S 
K"’ representing the function values ( 7 ^ = p(x(i))) and derivative values (F^ = gfx^i'^)) at each 
sample or boundary point X(i). Indeed, by introducing slack variables, this program is representable 
as a convex quadratically constrained quadratic program with 0{n) constraints, 0{n) variables, and 
a linear objective. Moreover, the pairwise constraints in this program are only enforced between 
neighboring points in the sequence of ordered locations xi^iy Hence the resulting constraint matrix 
is sparse and banded, making the problem particularly amenable to efficient optimization. 

Proof Throughout, we say that g is an extension of g if g(x(i)) = g{x(^y) and g'{xy^) = g'{xy^) 
for each x^y G supp(Q). Since the Stein objective only depends on g and g' through their values at 
sample points, g and any extension g have identical objective values. 

We will establish our result by showing that every g G feasible for the program (|TT|), so 

that S{Q,7p, ^^yyi^) lower bounds the optimum of o, and that every feasible g for 
extension in p € Sy.^jp so that iS(( 5 , Tp, Syyi^) ^Iso upper bounds the optimum of GB- 


TBTias 


0 


as an 


G.l Feasibility of 


Fix any g G ^y.^y^- Also, since g' is C 2 -bounded and ca-Lipschitz, the constraints ( |1 lb| l must be 
satisfied. Consider now the C 2 -bounded and Ca-Lipschitz extensions of g' 

i3(f) = max(—C 2 , max \g'(xp)) — cff — xu\'\\) and 
C/(f) = min(c 2 , min [p'(a;(i))-f ca|t - a:(i) |]). 

l< 2 <n' 


We know that B{f) < g'{f) < U{t) for all t, for, if not, there would be a point fg and a point xy) 
such that \g'{x(i)) — g'{tQ)\ > Ca|a;(i) — fol^ which combined with the Ca-Lipschitz property would 
be a contradiction. Thus, for each sample xy), the fundamental theorem of calculus gives 

g{xy+i))-g{xy))= g'{t)dt> 


The right-hand side of this inequality evaluates precisely to the right-hand side of the constraint 
( fTTcl i. An analogous upper bound using U(t) yields ( fTTdl ). 


Finally, consider any point xyy If xy) G {a, /?}, then (1 le 1 is satisfied as g(z) = 0 for any point 
z on the boundary. Suppose instead that a < X(i) < /3. Without loss of generality, we may assume 
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that > 0. Since 5 'is Ca-Lipschitz, we have (ji'(t) > p'(a;(j)) —Caji—foralH. Integrating 

both sides of this inequality from to Xu = + g'/c^, we obtain 

pXu pXu 

9 {xu) - gix(^i)) = g'{t)dt> g'{x(^)-C 3 {t-X(i))dt = g'{x(^f‘/{2cy,) 

d^{i) 

Since g{xu) < ci, we have + 9ix(i)) < ci. Similarly, by integrating the inequality 

from Xb = X{i) — 9 '{x( i))/c z to X(i), we have g{xb) — 9 {x(i)) > (/'(a:(i))^/( 2 ca), which combined 
with g{xb) < Cl yields (1 le 1 . 


G.2 Extending Feasible Solutions 

Suppose now that g is any function feasible for the program O- We will construct an extension 
g g by first working independently over each interval (a;(i), a;(i+i)). Fix an index i < n'. Our 
strategy is to identify a pair of C 2 -bounded, ca-Lipschitz functions mi and Mi defined on the interval 

[a;(i),a;(i+i)] which satisfy mi{x) < Mi{x) for all x G [a:(i), a:(i+i)], mi{x) = Mi{x) = g'{x) for 
X e {a;(i),a;(i+i)}, and mi{t)dt < 5 (a:(,+i)) - g{x(^,)) < Mi{t)dt. For any such 

{mi, Mi) pair, there exists (i G [ 0 , 1 ] satisfying 


g{x^,+i)) - g{x(^i)) = 

and hence we will define the extension 

g{x) = g{x(^i)) + 


5(i + l) 


CiTOj(f) -f (1 - Q)Mi{t)dt, 


(i) 


Cim{t) -I- (1 - Ci)M^{t)dt. 


E(i) 


By convexity, the extension derivative g' is C 2 -bounded and ca-Lipschitz, so we will only need to 
check that sup^^gi^- |5(3^)I < ci. The max imum magnitude values of g occur either at the interval 
endpoints, which are ci-bounded by (1 le I, or at critical points x satisfying g'{x) = 0, so it suffices 
to ensure that g is ci-bounded at all critical points. 

We will use the C 2 -bounded, Ca-Lipschitz functions B and U as building blocks for our extension, 
since they satisfy B{t) = U{t) = g'{t) forf G and B{t) < g'{t) < U{t), 

B{t) = max(-C 2 , 5 '(a:(i)) - C3(f - X(i:)),g'{x(i+i)) - C3(a;(i+i) - t)), and 

U{t) = min(c 2 , 5 '(a;(i)) -f C 3 (f - X(^), g'{x(i+i)) + C 3 (a;(i+i) - t)), 

for t G X(j+i)]. We need only consider three cases. 

Case 1: B and U are never negative or never positive on [a:(i), a:(i+i)]. For this case, we will 
choose mi = B and Mi = f7. By (1 Ic 1 and (1 Idi we know mi{t)dt < g{x(^ij^i'^)—g{x(i'^) < 


Mi{t)dt. Since B and U never change signs, g will be monotonic and hence ci-bounded for 


GO 


any choice of C,i . 


Case 2: Exactly one of B and U changes sign on [x(i), X(i+i)]. Without loss of generality, we 
may assume that g'{x(i)),g'{x(i+i)) > 0 and that B changes sign. Consider the quantity ip = 
max{B{t),0}dt. If p(a;(i_|_i)) — g{x^i-^) < p, we let mi = B and Mi = max{B, 0}. 


Since, on the interval B is piecewise linear with at most two pieces that can take on the 

value 0, B has at most two roots within this interval. However, since B{x) is continuous, negative 
for some value of x, and nonnegative at a; G {x(i), X(i+i)}, we know B has at least two roots. Thus 
let ri < r 2 be the roots of B{x). For any choice of Ci, the convex combination Qmi -f (1 — C,i)Mi 
will be exactly B outside (ri, X 2 ). Moreover, if Ci ^ 0, then this combination will be less than 0 on 
(ri,X 2 ), and if Ci = 0, the combinatio n wi ll be 0 on the whole interval. Hence it suffices to only 
check the critical points xi and r 2 . By (1 le 1 , mi{r) = Mi{r) = B{r) G [—ci, ci] for r G {xi, r 2 }, 
and so g will be ci-bounded. 


If instead p(x(i_|_i)) — (/(x(i)) > C>, we can recycle the argument from Case 1 with mi =max{i?,0} 
and Mi = U and conclude that g is ci-bounded. 
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Case 3: Both B and U change sign on Without loss of generality, we may assume 

that > 0, < 0. Since continuously interpolates between and 

on X(i_|_i)], it must have a root r. Let Wi G be the point where B changes from 

one linear portion to another. Then because B is monotonic on each linear portion, the fact that 
B{wi) < B(x(i+i)) < 0 means that B cannot have a root between [u'i,X(i+i)] and hence has at 
most one root on [x(i), X(i+i)]. Hence r is the unique root of B. 


In a similar fashion, let us define s as the root of U, and since B{x) < U{x) for all x, we have 
s> r. Define 

(B{x) xe[x(i),r) 

W{x) = Iq X G [r, s] 
ic/(x) fG(s,y], 


and ip = W{t)dt. As in Case 2, we will consider two subcases. If g(x(i+i)) — 5 (x(j)) < ip. 


we will let rm = B and Mi = W. By (llei, rriiir) = Mi{r) = B{r) G [—ci,ci], and since this is 
the only critical point, g will be ci-bounded. 


For the other case, in which 5 (x(i+i)) — 5 (x(i)) > ip, we choose rrii = W and Mi = U. Then 
(1 le I imply that mi{s) = Mi{s) = U{s) G [—Ci, ci], and, since this is the only critical point, the 
extension is well-defined on (x(i), X(i_|_i)). 


Defining g ontside of the interval [xi,x„'] It only remains to define our extension g outside 
of the interval [xi,x„/] when either a or /? is infinite. Suppose a = —oo. We extend g to each 
X G (—oo, xi) using the construction 


g(x) = f I[tG ixi-\g'{xi)\/c3,xi)]{g\xi)-C3sigii{g'{xi))t)dt. 

J —OO 


This extension ensures that g' is C 2 -bounded and ca-Lipschitz. Moreover, the constraint (llei 
guarantees that |g(x)| < Ci. Analogous reasoning establishes an extension to (x„', oo). □ 


H Equivalence of Constrained Classical and Spanner Stein Discrepancies 

For P with support X = (ai, /3i) x • • • x (ad, pid) for —oo < aj < pdj < oo, Algorithmj^computes 
a Stein discrepancy based on the graph Stein set 

Q\\-\\^,Q,(v,e) = : -T -)► I Vx G L, G {1,... ,d} with A: ^ j, and 6 ^ G {q!j,/3j} OR, 

max(Hg(x)||^, ||Vg(x)||^, ^ and,V(x,g) G E, 

_i"ll9(a;)-9(y)IU l|V9(a:)-V3(y)||^ \\g{x)-g{y)-Vg{x)(x-y)\\^ \\gioi:)-giy)-Vgiy){x-y)\\^\ ^ 

’ u^^-vWi ’ p\\x-y\\i y* - f ’ 


Our next result shows that the graph Stein discrepancy based on a f-spanner is strongly equivalent 
to the classical Stein discrepancy. 

Proposition 10 (Equivalence of Constrained Classical and Spanner Stein Discrepancies). If X = 
(ai,pdi) X • • • X (ad, Pd), and Gt = (supp(Q), E) is a t-spanner, then 

S(Q,Tp,Q\\.\\^) < S(Q,Tp,Q\\.\\^^Q^Gt) ^ t^i^dS(Q,Tp,Q\\.\\^), 

where Kd is a constant, independent of (Q, P, Gt, f), depending only on the dimension d. 

Proof 


Establishing the first inequality Fix any g G ^||.||j, -z G supp(Q), and j, k G {1,... ,(i} with 
k j, and consider any g-th coordinate boundary projection point 

6 G {z -I- ej(aj - Zj), z + ej(l3j - zj)} C R'^. 
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Since b G dX, we must have {g{b),n{b)) = {g{b), Cj) = gj{b) = 0. Moreover, for each dimension 
k 7^ j, we have Vkgj{x) = 0, since otherwise, {g{b + 5ek): n{b + 5ek)) = gj{b + Sck) ^ 0 for 
some (5 G M and b + 5ek G dX by the continuity of V^j. 

The smoothness constraints of the classical Stein set now imply that 

l5j(^)l = W - ffi Wl < kj - \'^k9,{x)\ = |Vfe5j(z) - VfcPj(5)| < \zj -hj\, 


and 

Iffi W - - bj)\ = \gj{b) - gj{z) - (Vgjiz), b - z)\ < -{zj - bjf 

so that all graph Stein set boundary compatibility constraints are satisfied. Hence, we have the 
containment ^ G\\-\\^,Q,Gt’ which implies the first advertised inequality. 


Establishing the second inequality To establish the second inequality, it suffices to show that for 
any g G 0|| ||j,Q,Gt’ j G d}, and C — there exists a function gj satisfying 


9jiz) = 9jiz), V5j(2) = \^gj{z), 9j{b) = 0, Vfcgj(5) = 0, Vfc 7^ j, (12) 

\9j{b)-9]iz)\<\\b-z\\j^, (13) 

||V5,(6) - V9j{z)\\^ < C\\b-z\\„ \\Vg,ib) - Vp,(&')IL < Cll^- (14) 

\9j{b) - 9j{z) - (^ 9 j{z),b-z)\ < ^\\b - z\\l, (15) 

\9j{z)-gjib)-{^9jib),z-b)\<Y\\b-z\\l, and (16) 

\9Ab) - 9 j(.b') - {Vg,{b'),b - b')\ < ^\\b - b'\\l (17) 

for all z G supp((5) and all b, b' in the j-th coordinate boundary set 


Bj = {bGW^:b = z + ej{aj — Zj) or b = z + ej{Pj — Zj) for some z G X}. 

Indeed, since such gj will satisfy max(|( 7 j(z)|, || (z)||^) < 1 for all z G supp((3) U Bj and 

l9i(^)-Sj(y)l \\'^9ji^)-'^9j{y)\\^ \9.iix)-g.iiv)-'^g3(^)(^-y)\ \9ji^)-gj(y)-'^gjiv)i^-y)\\ ^ 0^2 

ll^-9lli ’ ll^-9lli ’ h\\^-y\\l ’ hW^-vWl J- 

for all x,y G supp((5) by the argument of Appendix]^ the Whitney-Glaeser extension theorem ifTSl 
Thm. 1.4] of Glaeser fTU will then imply that there exists g* G t'^KdG\\-\\^, for a constant 
independent of g depending only on d, with g*{z) — g{z) and Vg*{z) = ^9{z) for all z G 
supp(Q). Since g and g* will have matching Stein discrepancy objective values, and each objective 
is linear in g, the second advertised inequality will then follow. 

Fix g G G\\-\\^,Q,Gt j S {!) ■ ■ •) d}. We will now construct a function gj satisfying the desired 
properties. Since gj and Vg , ar e determined on suppjQ), and gj and V^g^ are determined on Bj 
for fc 7 ^ j by the constraints |T^, it remains to define Vjgj on Bj. We choose the extension 


\/jgj{b)= min {Vjgj(z)+Clk-6|li} for all bGB^. 

zGsupp{Q) 


Fix any z G supp((5) and b G Bj, and let b* = z + ej{bj — Zj). The argument of Appendixj^ 
implies that V^g^ is ^-Lipschitz on supp((3), and hence it is also (^-Lipschitz on supp((5) U Bj. 
Since 


|Vfcgj(z) - Vfcgj(6)| = \Vkgj{z)\ < \zj - bj\ < ||z - b\\^ 


for all k 7^ j, we have (141. Moreover, the boundary compatibility constraints of G\\.\\^,Q,Gt imply 


\9jib) - 9j{z)\ = \gj{z)\ < \\b* - z||i < \\b- z\\^, 


establishing ( [T3| ). We next invoke the triangle inequality, the boundary compatibility conditions 
of G\\.\\^,Q,Gt’ Holder’s inequality, the Lipschitz derivative property d, and the fact ||z — 6|lj^ = 
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II6* — z\\j^ + II6* — 6||j^ in turn to establish ( [TS] ): 

\9jib) - 9j{z) - (^ 9 j{z):b- z)\ = \gj{z) - \^jgj{z){zj - bj) - (Vgj{z),b* - b)\ 

< \9j{z) - ^jgjiz){zj - bj)\ + |(V 5 j(&*) - \7gj{z),b* - b)\ 

< l\\b* - ^11? + ||V5,(6*) - Vgj{z)\\J\b* - b\\, 
<l\\b*-z\\l + C\\b*-z\\,\\b*-b\\, 

<^{\\b*-z\\, + \\b*-b\\,r = ^\\b-z\\l 

A parallel argument yields ( [T7] i. Finally, we may deduce ( [Tb] ), as 

\ 9 jiz) - gj{b) - {Vgj{b),z -b)\< \gj{z) - Vjgj{z){zj - bj)\ + \Vjgj{b) - yjgj(z)\\zj - bj\ 

< l(Zj -bjf+ C\\b-z\\^\Zj -bj\ < y||&-2||i 

by the triangle inequality, the definition of G\\-\\^,Q,Gt’ Lipschitz property ( |T^ . □ 
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